Statistical Analysis of Mandarin Acoustic Units and Automatic Extraction of Phonetically Rich Sentences Based Upon a very Large Chinese Text Corpus
نویسنده
چکیده
Automatic speech recognition by computers can provide humans with the most convenient method to communicate with computers. Because the Chinese language is not alphabetic and input of Chinese characters into computers is very difficult, Mandarin speech recognition is very highly desired. Recently, high performance speech recognition systems have begun to emerge from research institutes. However, it is believed that an adequate speech database for training acoustic models and evaluating performance is certainly critical for successful deployment of such systems in realistic operating environments. Thus, designing a set of phonetically rich sentences to be used in efficiently training and evaluating a speech recognition system has become very important. This paper first presents statistical analysis of various Mandarin acoustic units based upon a very large Chinese text corpus collected from daily newspapers and then presents an algorithm to automatically extract phonetically rich sentences from the text corpus to be used in training and evaluating a Mandarin speech recognition system.
منابع مشابه
A set of corpus-based text-to-speech synthesis technologies for Mandarin Chinese
This paper presents a set of corpus-based text-to-speech synthesis technologies for Mandarin Chinese. A large speech corpus produced by a single speaker is used, and the speech output is synthesized from waveform units of variable lengths, with desired linguistic properties, retrieved from this corpus. Detailed methodologies were developed for designing “phonetically rich” and “prosodically ric...
متن کاملAutomatic selection of phonetically distributed sentence sets for speaker adaptation with application to large vocabulary Mandarin speech recognition
This paper presents an approach of automatic selection of phonetically distributed sentence sets for speaker adaptation, and applies the concept to the task of Mandarin speech recognition with very large vocabulary. This is a different approach to the adaptation data selection problem. A computer algorithm is developed to select minimum sets of phonetically distributed training sentences from a...
متن کاملAutomatic extraction of phonetically rich sentences from large text corpus of indian languages
A set of phonetically rich sentences is a requirement for representing different speech units, to be used for developing Automatic Speech Recognition and Speech Synthesis Systems. Selecting such a set from a large text corpus without modifying the characteristics of the corpus is still a difficult task. A major concern in this process is to decide on what basis sentences must be chosen so that ...
متن کاملiCALL corpus: Mandarin Chinese spoken by non-native speakers of European descent
We present iCALL, a speech corpus designed to evaluate Mandarin Chinese pronunciation patterns of non-native speakers of European descent, developed at the Institute for Infocomm Research (IR) in Singapore. To the best of our knowledge, iCALL is larger than any reported non-native corpora to date in terms of utterance number, duration, and number of speakers: iCALL consists of 90,841 utterances...
متن کاملSingaKids-Mandarin: Speech Corpus of Singaporean Children Speaking Mandarin Chinese
We present SingaKids-Mandarin, a speech corpus of 255 Singaporean children aged 7 to 12 reading Mandarin Chinese, for a total of 125 hours of data (75 hours of speech) and 79,843 utterances. This corpus is phonetically balanced and detailed in human annotations, including phonetic transcriptions, lexical tone markings, and proficiency scoring at the utterance level. The reading scripts span a d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IJCLCLP
دوره 3 شماره
صفحات -
تاریخ انتشار 1998